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ABSTRACT 


This paper describes a fast algorithm and 
implementation of code excited linear predictive 
(CELP) speech coding. It presents principles of the 
algorithm, including (i) fast conversion of line spectrum 
pair parameters to linear predictive coding parameters, 
and (ii) fast searches of the parameters of adaptive and 
stochastic codebooks. The algorithm can be readily 
used for speech compression applications, such as on (1) 
high quality low-bit rate speech transmission in point- 
to-point or store-and-forward (network based) mode, 
and (ii) efficient speech storage in speech recording or 
multimedia databases. The implementation performs in 
real-time and near real-time on various platforms, 
including an IBM-PC AT equipped with a TMS320C30 
module, an IBM PC 486, a SUN Sparcstation 2, a SUN 
Sparcstation 5, and an IBM Power PC (Power 590). 


1. INTRODUCTION 
1.1, Why is CELP Useful ? 


Obtaining efficient representation of speech at low bit 
rates for communication or storage has been a problem 
of considerable importance, because of technical as well 
as economical requirements. Telephone-quality digital 
speech in a pulse code modulation (PCM) form requires 
a 64 kbits/s rate which cannot be transmitted in real time 
through 6 kHz and 30 kHz channel capacities of HF and 
VHF bands, respectively. Voice mail and multimedia 
employ speech storage, demanding efficient ways of 
storing speech, since one minute of PCM speech already 
requires 480 kbytes of storage space. Even if the 
channel can accommodate real-time speech, speech 
compression allows more communication connections 
to share the precious channel. Similarly, speech 
compression allows more speech messages to be stored 
in the storage of the same size. 


This paper describes a speech compression technique 
for those purposes, called code-excited linear predictive 
(CELP) coding [Atal86] [JaJS93], which obtains bit 
rates of as low as 4.8 kbits/s, giving a compression ratio 
of up to 13: 1 [CaTW90]. Although this rate is higher 
than a 2.4 kbits/s linear predictive coding (LPC), speech 
compressed by CELP has quality, naturalness, and 
speaker recognizability, which are missing from the 
LPC. 

The importance of CELP goes beyond its quality vs. 
bit-rate performance, as it *provides a generic structure 
for future generation of perceptual speech coders 
[JaJS93]. All speech compression techniques have been 
based on two intrinsic operations: removal of 
redundancy and removal of irrelevancy. The first 
operation uses prediction and/or transforms to remove 
redundant data, thus reducing the bit rates. The second 
operation further reduces the bit rates through 
quantization of(i) the time components of the prediction 
error or (ii) the transform coefficients, allowing 
mathematically non-zero but imperceptible 
reconstruction error or distortion. 

If further compression is still required, the coder 
minimizes the error perceptibility by exploiting masking 
properties of human speech perception. To certain 
extent, the speech energy itself perceptually masks the 
distortion. Thus the same energy levels of distortion 
have different perceptual effect if applied to speech 
signals with different energy levels. This approach 
promises a new level of higher quality and lower bit rate 
speech compression [JaJS93]. Coders that minimize 
perceptual distortion (such as CELP) are called 
perceptual coders. 

One novelty of CELP is in incorporating the masking 
property in a working, practical scheme. Such 
incorporation is non trivial because perceptual distortion 
measures lack tractable means that have often been 
available in the traditional distortion energy measure. 
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The CELP solution to this problem is by using an 
analysis-by-synthesis approach, where the perceptual 
distortion is literally measured. CELP then exploits the 
computational structure, resulting in a sophisticated, 
practical compression — technique. Clearly, the 
computational cost is very high. 


1.2. Conceptual CELP 


As shown in Fig. 1, a conceptual CELP structure 
{ScAt85] consists of: 

a. two predictors (pitch and spectral predictor filters) to 
remove redundancy caused by long and short term 
correlations among speech samples, respectively; 
and 

b. a close-loop, perceptual vector quantizer utilizing a 
codebook to remove irrelevancy indirectly from the 
time components of the prediction error. 

The codebook stores random (stochastic) signals as 

prototypes of excitation signals for the two predictor 

filters. | Furthermore, a perceptual weighting filter 
ensures that mean-square error measurement reflects the 
perceptual error measurement. 

The CELP compressed speech then consists of: 

a. a set of spectral predictor parameters; 

b. a set of pitch predictor parameters; and 

c. codebook (entry and gain) parameters. 

It is these CELP parameters than can be transmitted or 

stored at rates as low as 4.8 kbits/s. 

The speech compression algorithm begins by 
obtaining the predictor parameters, and then searching 
for codebook parameters corresponding to excitation 
prototype that minimizes the perceptual error. The 
CELP decompressor uses the codebook parameters to 
produce the excitation signal, exciting the cascade of 
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pitch and spectral filters, resulting with the 
decompressed speech. 

The selection of the predictors and the quantizer is by 
no means arbitrary. They match elements of a model of 
human speech production system [Lang92]. The model 
consists of an excitation source and a vocal tract. 
During voiced speech articulations, the excitation source 
produces quasi periodic pulses which excite the vocal 
tract. The pulses are subjected to resonance and anti- 
resonance processes in the vocal tract according to the 
changes in the vocal tract shape over time, resulting in 
audible and meaningful speech. Similar processes take 
place during stop and fricative articulations. However, 
the excitation source should produce noise-like 
excitations instead. In matching the model, CELP uses 
the spectral predictor filter to perform vocal tract 
function. The pitch predictor filter (usually a one-tap, 
all-pole filter) ensures the quasi-periodicity of the 
spectral filter excitation. In this cascaded filter 
structure, it is known that voiced speech signals have 
excitations of Gaussian distribution. Thus the codebook 
members represent such excitations. It also 
accommodates excitation for stops and fricatives. The 
fact that the CELP structure serves both signal 
compression principles (i.e., redundancy and 
irrelevancy removals) and speech production model 
(i.e., an articulation source and vocal tract) is the reason 
for the CELP highly successful performance. 


1.3. Implementation Problem 


Despite its concept maturity, real-time CELP 
implementation is still a complex problem. The 
codebook searching is so computationally demanding 
that a direct implementation requires very long 
computation time, much more than real-time 
requirement. In the searching process, each prototype 
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Fig. 1. Conceptual CELP analyzer. 
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must go through three filtering (the pitch, spectral, and 
perceptual filters) and one mean-square processes. It is 
easy to show that a brute-force approach would require 
a processor with more than 34 million MIPS, for a real- 
time CELP [Lang92]. An early ‘practical’ CELP 
implementation required 125s of Cray- 1 computation 
time to process one second speech [ScAt85], while real- 
time procedure must process one second of speech in 
one second or less. 

Thus, a practical CELP system must employ fast 
algorithms, which exploit the computational structure of 
a CELP scheme. In the process of developing practical 
CELP, the actual structure becomes significantly 
different from the conceptual one, while still performing 
the same functions (see [Lang92] for details on the 
transition). For example, the spectral parameters are 
quantized and represented now by a set of line-spectrum 
pairs (LSP) [SoJu84]. The pitch filter becomes another 
codebook, called adaptive code book (ACB). The 
codebook of the random signals is then called stochastic 
code book (SCB). 

Unfortunately, the fast algorithm has significantly 
increased the implementation complexity as the 
optimization blurs the structure in favor of speed. The 
algorithm now combines the spectral predictor and the 
perceptual weighting filter into one filter. A joint 
optimization scheme searches for the suboptimal 
combination of codebook parameters, instead optimal 
combination through total exhaustive search of all 
combinations, as implied by the conceptual structure. 
The use of a special SCB results in a fast iterative 
search, in which the results of the perceptual distortion 
calculation from current prototype helps the calculation 
of that of the next prototype. It should be noted that 
although there is a proposed U.S. Federal Standard (FS) 
1016 CELP [CaTW90] which describes each bit in the 
compressed speech, it does not specify how to obtain the 
compressed speech, leaving it to CELP implementors to 
develop one. 


1.4. Paper Overview 


The remaining part of this paper describes a practical, 
near real-time CELP algorithm, which reduces the 
computational power requirement by a factor of more 
than 175,000. Section 2 describes the procedures to 
compress and decompress speech. This paper focuses 
mainly on the description of algorithms compatible with 
the FS-1016 to enable communication with other FS- 
1016 systems. In Section 3, we briefly explain the 
actual computer implementation, resulting in 
performance ranging from 14 to 0.85 of real time, 
depending on the platform. The algorithm has been 
implemented on an IBM PC-AT equipped with a 
TMS320C30 (C30) evaluation module (EVM) 


{LaKi91],[Lang92]. The system is suitable for PC- 
based packet radio or speech recording systems. The 
algorithm has also been ported to the various UNIX 
platforms as well as MS Windows 3.1 platform for a 
voice mail development. Section 4 discusses 
performance of the various implementations, including 
their limitations. Finally, Section 5 provides 
conclusions. 


2. Fast CELP Procepures 


2.1. Input and Output 


In practice, CELP is a block coding, in which a frame 
of 240 PCM speech samples s{n] (with a total of 1.92 
kbits) denoted as a vector s is converted to 144 bits of 
compressed data, called FS-1016 CELP parameters or 
data stream. The CELP parameters now consist of: 

a. the line spectrum pair (LSP) parameters; 

b. the adaptive codebook (ACB) parameters; and 

c. the stochastic codebook (SCB) parameters. 

All LSP, ACB, and SCB parameters are entries 
(indexes) of quantization tables and codebooks, namely 
LSP table, ACB, ACB gain table, SCB, and SCB gain 
table [LaKi90], {LaKi91]. They all require 138 bits 
only. The remaining 6 bits can be used for error 
correction, synchronization, and future expansion. 

Naturally, the CELP procedures should perform a 
CELP compressor and decompressor system extracting 
CELP parameters from s{n], and reconstructing s back 
from the FS-1016 data stream. Specifically, a CELP 
compressor (usually called analyzer) requires (i) LSP 
analysis procedure to obtain the LSP parameters, and 
(ii) codebook search procedure for both ACB and SCB 
parameters, while a CELP decompressor (usually called 
synthesizer) requires speech synthesis procedure. We 
describe the procedures as follow. 


2.2. LSP Analysis 


The CELP analyzer obtains the LSP parameters 
through the following three steps: (i) performing linear 
predictive coding (LPC) analysis on the PCM samples 
to represent spectral information [Pars86], {Proa83], (i1) 
converting the LPC parameters into LSP parameters 
{KaRa86], [HaHe90] for efficient representation, and 
(iii) ensuring LSP parameter stability. 


2.2.1. LPC Analysis 
The aim of LPC analysis is to obtain LPC parameters 
a; (collectively denoted as a) corresponding to the 


spectral filter. The spectral (or LPC) filter models a 
human vocal tract. One most common model is a 10- 
order all-pole digital filter H(z) with ten coefficients a;, 


as follows 
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Hg) =—1—_ (1) 
1+ »; a.Z 
i= 1 
Let the input (excitation) of this filter be a zero-mean 


signal t. The output of this filter is then $ , according to 
(in z-domain notations) 


S(z) . H(z)T(2). (2) 
For a given s, the LPC analysis finds a that minimizes 


ls -§|. The elements (a;) of such a vector a are LPC 


parameters, which are the solutions of a linear equation 
system 


10 
0S Vary gS Geka, 2106) 
i=0 


where r; are autocorrelation terms defined as 
N-1 

ix= > s[n]s[n-i] 
n=i 
2.2.2. LPC to LSP Parameter Conversion 

The system must quantize a using the LSP analysis 


since a; are 10 real numbers which require too many bits 
( 10x16=160 bits) for representation. On the other hand, 
the LSP parameters (we call them LSP}) are more 
efficient (only 34 bits) because they are ten integers 
ranging form 0 to 8 (or to 16), corresponding to the 
entries of a suitable LSP table. 


To show the conversion, we first show that a can be 
represented by z;, which are zeros of two polynomials 


P(z) and q(z) related through 
Al2) = 5(P@) + 402) 


i=0,....10 (4) 


pe) =A@) +z A) o 


gz) =A@)-z AG) 


Clearly, polynomials p(z) and q(z) represent H(z). In 
other words, zeros Zz; of p(z) and q(z) (eleven each) can 
represent a. 

Furthermore, z; can be represented fully by @, as 


@, = arg (2;); i=0,...,9 (6) 


where arg(*) is the argument of a complex variable. The 
proof relies on the fact that z= | and z=-1 are always 


the zeros of p(z) and q(z), respectively. Thus the 20 
remaining zeros are sufficient to represent p(z) and q(z). 
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Furthermore, all z; are symmetric about the real axis, 
and lie on the unit circle in the z-plane. Thus, 10 zeros 
(below the real line) are actually redundant, leaving us 
with the remaining 10 significant zeros, which uniquely 
corresponding to 10 values of @; through Eg. (6). 
Furthermore, it can be shown that @, with even and odd 
i correspond to p(z) and q(z), respectively. We then 
conclude that these 10 values of @ can reconstruct all 


the zeros of p(z) and q(z), thus representing a. 
Equivalently, for a given a, we can always derive such 


Having obtained @;, we can efficiently represent them 


through quantization. Although we can directly 
quantize a;, the dynamic range of a; is high (i.e. there are 


many significant values of aj), requiring many 


quantization steps to achieve low quantization error. On 
the other hand, each @ has a much limited dynamic 


range, since the ranges of @, are disjoint subintervals Sj, 
in a real-number interval of 0 to 7, ie., 


@.e S.; OSUS;<m; 
j 


id 
i#j= (S;NS;=©); 


(7) 
j=0,@ #9 


Thus, fewer quantization steps for @; can achieve the 
same quantization error. 

We then use the FS-1016 LSP table to quantize @, 
For each @, FS-1016 sets a list of 8 possible quantized 
values of @; (or 16 if j is 2 to 5), covering S; and its 
neighborhood. Thus, there are 10 lists, namely list j, j = 
0 to 9, collectively called the FS-1016 LSP table. Let 
LSPTable{[j,i] be a particular quantized value indexed by 
iin list j, where ¢ is from 0 to 7, or to 15. We quantize @; 
by selecting i such that LSPTable{j,i] is the closest value 
to @,; in list j. Now, assigning such an i to LSP; and 
performing similar steps for all j, we have LSP; as a 
representation of the quantized @;. We have called those 
LSP; as LSP parameters, which can now represent a. 
This representation is efficient because we only need 
34+-44-.44-44-4434+343+43+3 = 34 bits for each a, instead of 
160 bits in the original floating-point form. 

One advantage of using the FS-1016 LSP table is that 
we can derive a fast LSP conversion algorithm, by 
searching the table without actually knowing the exact 
zeros. There are numerical methods such as Newton- 
Rhapson and Jenkins-Traub [PTVF92] for finding the 
zeros of p(z) and q(z), but they are tedious. 
Furthermore, the exact @ must later be quantized 
anyway. 


A different and faster approach is by checking zero- 
crossing of a new pair of polynomials p(x) and q(x). 
These polynomials are related to p(z) and g(z) in the fact 
that their zeros, x;, are 


X; = COS ); (8) 
Such p(x) and q(x) must then take a form of 
5 : 
Ba) = Yb 
se (9) 
q(x) = >, cx 
i=0O 
where the coefficients b and c are 
bs = 32 
b, = 16p, 
b, = 8(p,-5) 
, = seep ) os 
2 3 1 
b, = 2(p4-3p, +5) 
by = P5-2p3+ 2p, 
and 
C5 = a2. 
c, = 164, 
Cz = 8 (gq, —-5) iz 
Cy = 443-444) 
c, = 2(44-39, +5) 
Co = 95-2934 24, 


Here, p; and q; are coefficients of p(z) and q(z), 
respectively, where i refers to a polynomial term 
containing z. The Po and go are always equal to one. 
For a given a, it is easy to show using Eq. (7a) that the 
remaining p; and q; can be obtained recursively through 
aloopofifrom 1 to 5 of 
-=a.t+a .-?p. 
Pj L 11-i Pi-1 (12) 
9; = 97% 911-17 Gi-1 


The fast LSP conversion then uses the fact that each x 
associated with a zero of p(z) or q(z) causes p(x) or g(x) 
to be zero, respectively. Thus, the scheme applies 
values of x corresponding to @ in the LSP table (ie., 


LSPTable[j,i]) to the polynomials p(x) and q(x), and 


observes for zero crossings. As before, j even and odd 
correspond to p(x) and q(x), respectively. For each j, 
the scheme then assigns certain i to LSP;, such that 
x=LSPTable[j,i] is the closest x within the same j that 
causes a zero crossing of p(x) or g(x) . 


2.2.3. Ensuring LSP. Stability. 


We must have a scheme for robust representation of 
the LPC parameters, because they are very sensitive and 
the conversion to LSP parameters increases the 
sensitivity. Since H(z) is a recursive filter, a distortion in 
a can easily move the poles of H(z) to outside the unit 
circle of the z-plane, resulting in an unstable H(z). The 
conversion to LSP further introduces more distortion 
due to quantization errors. 
the ordered values 


Fortunately, if of @ 


monotonically increasing (from 0 to 1), the LSP method 
guarantees the stability of H(z) [SoJu84]. Thus, before 
transmitting the LSP;, the scheme verifies the ordered 
values of @; corresponding to LSP,. If the ordered 
values violate the monotonicity, the scheme replaces it 
with a stable set of LSP; form previous frame. 


are 


Sometimes, the pre-defined quantization steps can also 
create a stability problem. There are cases when some 
adjacent @; are too close together, so that for the given 
resolution, the table fails to distinguish them. Or, the @; 


may lie beyond the table coverage. In this situation, the 
fast LSP conversion usually gives incorrect, unstable 
LSP}. An effort to avoid such cases is by expanding the 


bandwidth of a prior to LSP conversion process. Thus, 
instead of using a, the scheme use c, defined as 


c= ay (13) 


where y is the expanding factor (typically set to 0.994), 
and 7 is an index from | to 10. 
2.3. Codebook Parameter Searching 


2.3.1. Searching Problem 
To obtain the codebook parameters, the analysis 


searches for codebook ‘parameters minimizing 
perceptual distortion 
2 all2 Per? 
lel” = Is—Sly = [Pyts-S] (14) 


where [lel] denotes a norm (or magnitude) of a vector, 
and P,, represents a perceptual weighting filter defined 


as 


(15) 
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A typical y is 0.8. (Such a Pyf{z) makes Eq. (2) a 
perceptual spectral-masking based measure rather than 
simply a pure Euclidean measure of waveform 
closeness). We call e the perceptual error vector. 

The codebook parameters affect perceptual distortion 
in Eq. (14) through the excitation t and then § . A 
codebook consists of prototypes or codewords b, which 
are arrays of impulses b[n]. Each codeword is indexed 
by a codebook entry called CBEntry. For each 
codebook, there is a gain table containing gain factors, 
which are real numbers. Each gain factor is indexed by 
a gain table entry called GainEntry. Thus for the ACB 
and SCB there are ACBEntry and SCBEntry, 
respectively, while for the ACB and SCB gain table 
entry there are ACBGainEntry and SCBGainEntry, 
respectively. 

A set of those entries produces t according to 


t=b (a) (ACBEntry)g (a) ACBGainEntry) + 


b (s) (SCBEntry)g a) (SCBGainEntry) (16) 
The byq(ACBEntry) and bis(SCBEntry) are the ACB 
and SCB codewords pointed by ACBEntry and 
SCBEntry, respectively, while g(q(ACBGainEntry) and 
8(s)(SCBGainEntry) are the ACB and SCB gain factors 
pointed by ACBGainEntry and SCBGainEntry, 
respectively. For a given s, the t produces § and then e 
according to Eq. (2) and Eq. (14), respectively. Thus the 
search problem becomes: for a given searching target s, 
find ACBEntry, SCBEntry, ACBGainEntry, and 
SCBGainEntry corresponding to e that minimizes Eq. 
(16). 

To solve the searching problem, there are several 
techniques such as those described in [KIKK90]. 
However, not all off them can be combined. We 
describe here fast searching algorithms that we actually 
use. Some are mandatory (implied by FS-1016), while 
some are our choice. We also discuss their 
consequences in the scheme. 


2.3.2. Breaking the Frames into Subframes 


One obvious way to reduce the computational cost for 
searching is by reducing the size of the codebooks, i.e., 
reducing the number of prototypes in the codebook. 
However, this approach increases the vector 
quantization error. To reduce the quantization error, one 
should reduce the length (i.e., dimension) of the 
prototype. However, this increases the bit requirement 
because we need more prototype to represent a segment 
of t. FS-1016 solves this delicate balance by using a 
prototype length of 60 samples. This means, the 
searching target in one frame is split into four s in four 
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subframes, and the scheme performs four searching 
processes to complete encoding of one frame, resulting 
in four sets of codebook entries. The SCB size can then 
be reduced to as low as 5 12 while preserving natural 
speech quality. 

It should be noted that since ACB is a codebook that 
actually represents a one adaptive tap, all pole pitch 
filter [Lang92], its size is not determined this way. The 
ACB size determines the range of pitch frequency it can 
cover. For an excitation x[n], the filter produces 


Y{n] = gy{n-d] +x[n] (17) 


with g as the filter coefficient (equivalent with ACB 
gain) and d is the tap position (equivalent with ACB 
entry). Varying d changes the pitch frequency (in Hz) 
according to 

Sampling Frequency (18) 


Pitch Frequency = 7 


FS-1016 covers pitch frequency between 54 Hz to 400 
Hz, requiring d to be between 20 to 147. Thus, we use 
an ACB size of 128. FS-1016 actually provides a size 
option of 256 to improve the pitch resolution in high 
frequency (associated with woman speakers). It is clear 
from Eq. (18) that the pitch resolution at higher 
frequency is coarser. The additional ACB entries are 
then added to improve the high frequency resolution. 
To reduce the computational cost, we did not use this 
option. 

The subframe search approach also enable a smoother 
transition of LSP parameters through interpolation. 
Thus for each subframe i = 1, . . ., 4 , the scheme uses 
different H(z) coming from interpolated LSP parameters 
defined as 

_ 9-2i i-| 

1 8 

Thus the system must always keep the LSP parameters 
from the previous frame. 


Previa), + 2 Present; (19) 


2.3.3. Combining Perceptual and Spectral Filters 


We can reduce the computation cost by reducing the 
number of filters used during the search. To compute 
the perceptual distortion in Eq. (14), each prototype 
must pass through the LPC filter and the perceptual 
weighting filter. In the z-domain, the perceptual 
distortion vector is 


E(z) = P, (2) {S(z)-3(2)} 
=P,@S(2)-P,@H(T(2) gy 


Y(z) — W(z) T(z) 
= ¥(z)-X(z) 


where 
Y¥(z) =P, (@)S(z) (21) 
z 
W(2) = POH @) = HG) = He (22) 
X(z) = W(z) T(z) (23) 


Observe that there is only one filtering W(z) required 
now (ie., Eq. (23)) for every prototype. As a new 
searching target, Y(z) is calculated once only using Eq. 
(21), and then the search minimizes (in vectorial 
notations) 


2 2 
lle” = lly-xl (24) 


There is a slight problem of this approach if we 
calculate Eq. (23) in vector and matrix operations. In a 
matrix form, filter W(z) is approximated by a 60 x 60 
matrix W defined as 


wf0] 0... 0 
we w[l] w(O]... 0 (25) 


w [59] w [58]... w [0] 
where w/i] are the impulse responses of W(z), such that 
xX = wt (26) 


Unfortunately, the search results are good only if the 
CELP synthesizer also uses H(z) in a matrix form, 
which is not the case. Let z be the zero response of H(z) 
at the synthesizer, i.e., z[m] are the output of the H(z) 
when its input is zero for all subframe. In practice, z is 
not zero due to the non-zero contents of the H(z) delay 
elements, resulting from the previous excitation. Thus, 
the actual output of the synthesizer is 


§ = Ht+z (27) 
The analyzer must then introduce a compensation 


scheme such that we minimize Eq. (14) but still use 
combined filter W with Eq. (26). From Eq. (27) we have 


Ps = P Ht+P = x+P iz (28) 


Using the derivation in Eq. (20), we have 


2 2 
Il” = P,s—x-P,,2| 


Peo -af 2 


us 2. 
= lly -xl 


Now, y is the new searching target, defined as 


y = P (s-z) (30) 


Let e(CBEntry, GainEntry) be the perceptual error 
vector corresponding to a codebook entry CBEntry and 
a gain entry GainEntry. Clearly minimizing 


le(CBEntry, GainEntry)|" = 


(31) 
lly -x(CBEntry, GainEntry)|” 


is equivalent to minimizing Eq. (20), with z has been 
taking into account. Figure 2 shows the new structure. 


2.3.4. Serial Search 


To further reduce the computational cost, the scheme 
serially searches the ACB parameters before the SCB 
parameters. The system uses 5 12 and 128 entries for 
SCB and ACB, respectively, and 16 entries for each gain 
table. If the scheme has to search all codebooks 
simultaneously, it has to search — through 
512x128x16x16 = 16,777,216 entries. On the other 
hand, serial search works on 5 12X16 + 128x16 = 10,240 
trials only. 
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Fig. 2. Practical CELP analyzer. 
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Consequently, ACB and SCB searches differ in the 
searching targets. The searching target of ACB is y as 
defined in Eq. (30). The resulting ACB parameters 
alone can produce x according to Eq. (26), but they 
result in a high lle”. The SCB parameters must then 
generate a signal that ‘fills the gap’ between y and such 


an x. Thus, y — Wt becomes the SCB searching target, 
where t is obtained from Eq. (16) using newly obtain 
ACB parameters but without SCB parameters. 


2.3.5. Joint Optimization Search 


A joint optimization scheme suboptimally searches for 
codebook and gain entries in one process, thus further 
reducing the number of prototype trials. In minimizing 


le(CBEntry, GainEntry)|” , the system should search 
through all combinations of CBEntry and GainEntry. 
However, the joint optimization scheme assigns an 
optimal GainEntry for each CBEntry, so that the scheme 
effectively searches for CBEntry only. In other words, 
instead of searching through 10,240 entries, the scheme 
only needs to search through 512+128 = 640 entries. 
This suboptimal solution saves computation in an order 
of magnitude. The basic approach is as follows. 
1. For every codebook entry called CBEntry, compute 
v (sometime called the normalized x, i.e. the x 
obtained with unit gain, according to 


v = Wb[CBEntry] (32) 


Here, the b[CBEntry] is a prototype in the codebook 
pointed by the CBEntry. This process is often called 
convolution. 

2. For every CBEntry, compute GainEntry associated 
with the CBEntry. Suppose g is the gain value which 
scales b to become t. Clearly, 


X = gv (33) 


One way to minimize Eq. (31) is to maximize a Peak 
value defined in inner-product terms as 


Peak = {y, x) - (x, x) = gy, v)-8°(¥, v) (34) 


To find the best g to maximize Eq. (34), we take a 

derivative of Eq. (34) with respect to g, and find its 
root. The root, which is the best g, turns out to be 

_ iy) se 

&§ = (v, v) ( ) 

Furthermore, GainEntry is now the index whose 
value in the gain table is the closest value to this g. 

3. For every CBEntry, compute also the Peak value 

using Eq. (33). 
4. Find the CBEntry that has the closest distance, that is 
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one with the highest Peak value. This CBEntry and 
its associated GainEntry become the desired code- 
book parameters. 

Notice that there are three main computational 
processes: the convolution to obtain v and the two inner 
products ¢y, v) and (v, v) . They are called many times, 
as many as the codebook size. Consequently, they are 
the bottleneck of the system. 


2.3.6. Fast Convolution with Special Codebooks 


The search scheme employs a fast convolution 
algorithm for the convolution in Eq. (32) by exploiting 
the overlapping property of the codebook elements 
[KIKK90]. As a result, some of the convolution results 
of an entry can be used to compute convolution of the 
next entry. Let us design an SCB such that all the 
prototypes’ elements come from an array r having 1082 
elements. Suppose the elements of a prototype pointed 


by CBEntry (i.e., b(CBEntry)) are b CBEntry [i] with i 


=o,....59. Then we force the elements to be 


bepentryHl = r(2(S11-CBEntry) +i] (36) 


It can be verified that the prototypes are overlapping, 
i.e., most elements of a prototype are also elements of 
another prototype in its neighborhood. 


With this special SCB, we can obtain YCBEntry [i] 
using Eq. (32) as follows 
59 


YcBEntryl = Y WL - ecg Emr Ul (37) 
j =O 


To simplify the notation, define u(CBEntry,i,j) as 
u (CBEntry, ij) =w{j-i] bCREntry Ui] 


(38) 
= w[j-ijr(2(511-CBEntry) +j] 
We then have 
l 
YepEniryll = >) « (CBEntry, i, j) + 
j =0 
61 
x u (CBEntry, i, j) — (39) 
jz2 
61 
Dy u (CBEntry, i, j) 
j = 60 


= head term + middle term — tail term 


It can be verified easily that the middle term is exactly 
YCBEntry - 1 [i] because of the overlapping property. 


This remarkable fact leads to a fast iteration for 
convolution. Now, instead of performing 60 terms of 
multiply and accumulate (MAC) operations as implied 


by Eq. (37), the scheme calculates a YOBEn try [i]in 4 


MAC only to obtain the head and tail terms, and uses 
the previously calculated YCBEntry- ile] as the 


middle term. The computational cost reduction is by a 
factor of 15. 

We can even avoid having to compute the tail term if 
we can afford having a long array v' and a short array v" 
of length 1082 and 60, respectively, as shown in the 
following modified joint-optimization algorithm. 
|. We start with computing Yo [i] using the old 

method (Eq. (37)) as a starting point for iteration. 
Store the results into an empty v' according to 


v'[i] = vp [i] ; i=0,....59 (40) 

2. Calculate Peak and GainEntry as in the joint optimi- 
zation, and store them in BestPeak and BestGainEn- 
try, respectively. Store also CBEntry (in this case is 
0) into BestCBEntry. 
Then for every CBEntry =1,....511, perform: 

3. Calculate the 60 head terms and store it in y". 

4. Update the array v' according to 


v' [i+ 2CBEntry] —v' [i + 2CBEntry] + v" [i] (41) 


5. Calculate Peak and GainEntry as in the joint optimi- 
zation. However, get v from v' according to 


v(i] =v'[i+2CBEntry]; i=0,....59 (42) 


6. Compare Peak with a variable BestPeak (predefined 
as zero). If current Peak is larger than BestPeak, the 
scheme updates BestPeak with Peak, and stores 
CBEntry and GainEntry in BestCBEntry and Best- 
GainEntry, respectively. 

After performing those steps for all entries, the desired 

parameters are available in BestCBEntry and 

BestGainEntry. 

Further cost reduction is due to the fact that FS-1016 


SCB uses b CBEntry [i] that is not only overlapping but 
also sparse (77% of the elements are 0) and ternary (i.e., 
the elements takes values -1, 0, and 1 only). Thus 
before calculating the head terms in the Step 3 above, 
the scheme checks if bcBEntry [j] is zero. In such 


60 computations of the term using this 
[j] in Step 3 are skipped. The scheme 


cases, 


b CBEntry 
should have 77% of such cases. 


With ternary b Li] , multiplications in 


CBEntry 
computing the head terms are not necessary anymore 
because multiplication by 1 and -1 are equivalent with 
changing sign only. 

Although the above example is derived for the SCB 
search, the ACB search can also use fast convolution. 
Since ACB is actually a one-tap, all-pole filter, the 
overlapping property is inherent in the ACB. However, 
the ACB elements are not ternary nor sparse, thus both 
calculation of the head terms and multiplications cannot 
be omitted. But, the calculation is fast already, because 
the number of MAC in its head term is one only (except 
in some special cases at the lower entries), instead of 
two as in the SCB. Furthermore, the size of the ACB we 
use is 128 as opposed to 5 12 of the SCB. 

It should be clear that this fast convolution works only 
if we use W(z) in a matrix form, otherwise we cannot 
have Eq. (37) and the rest of its derivations. 


2.3.7. Delta Coding for ACB Parameters 


Further computation reduction is possible for ACB 
search. Here we utilize the fact that human pitch does 
not suddenly change within two subframes (15 ms). 
This means we expect that the difference between 
selected codebook entries of consecutive subframes can 
be less than 64 entries. Thus we can employ delta 
coding that codes the entry difference only. Such 
coding needs a reference point. The FS- 1016 uses ACB 
entries of odd subframes as the references and delta 
codes the even subframes, i.e., the entry of the second or 
the fourth subframe is represented by the difference 
between the actual entry and the previous-subframe 
entry. This scheme reduces the computation because the 
even search routine operates on a subset of the ACB 
only (64 entries instead of 128 entries). This scheme 
also reduces the bit rate since the number of bits to 
represent the difference is less than that to represent the 
actual entry. 


2.4. Speech Synthesis 


2.4.1. Synthesis Process 


A CELP synthesizer reconstructs 240 samples of s 
from a set CELP parameters. In principle, the 
synthesizer must first construct the filter H(z) using the 
interpolated LSP parameters. The synthesizer then 
computes the excitation impulses t for one subframe 
using the codebook parameters according Eq. (17). 
Finally, it applies the excitation impulses t to the filter 


H(z) to synthesize 60-element speech § using Eq. (2). 
Repeating the process three more times results in a 


complete 240 elements of 8. 
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Since most of the steps have been explained, we just 
describe here the conversion of LSP to LPC parameters. 


2.4.2. LSP to LPC Conversion 


We want to reconstruct a from the interpolated LSPs 
@);. Let us define xp and xq according to 


XP; = Op; 
Ag Oped 


i=0,...4 (43) 


The following steps then convert the LSP: 
1. Recover the array b as in Eq. (10) according to the 
following equations 


b, = 32 


> 
I 


4 
a= bs D) Pig 
i=O 


45 
=b,>) >) x27; 
fa iperei mn 


er ee 
2=->s>) DL P7277? m 


i=Qj=itin=j+l 


2 3 4 5 
1= 95d, YD LL PRP AP Pn 
i=lj=itim=j+In=m+l 
—b,, (XP | xP>*P3XP 4*P 5) 


o 
Ww 
| 


> 
" 


> 
t 


bo 


2. Recover the coefficients of p(z) according the fol- 
lowing equations 


b 
Pi = a 

b, +40 
Py = 8 

b, +16p, (45) 
Ls ae 

b, + 6p,- 10 
Pa=—_z 


Ps = by+2p3-2P, 


3. Recover array c as in Eq. (11) according the follow- 
ing equations 


106 


= 32 
4 
C4 = —¢5 944 
i=O 


4 5 
esd Dy 2989 


i=Ij=itl (46) 


3 4 ) 
Cy = C5 b? Dy »y XG X9%4 


i=Oj=itin=j+i 


es 4 5 
ngd EEE separ 
i=1lj=it+lm=j+ln=m+1 
Cy = C5 (444 X99%95%9 4X95) 


4. Obtain set of g(z) coefficients, according the follow- 


ing equations 


c 
4 
q - 16 
c, +40 
v = 8 
C2 + 104, (47) 
93 = 7 _ 
c, + 6q,- 10 
14 = y) 


5. Finally, use p and q to construct a by inverting Eq. 
(12), as follows 


dy = 
Py = % =! 
+p.+q:-4 
Rist PV Ail, eg. 5 (AB) 
l 2 
P; ytPi* 49; q; 
a4 -i= 7 : |=1,. 5 


3. COMPUTER IMPLEMENTATION 


We can now translate the above algorithm to a 
computer implementation. We have coded the 
procedures in ANSI C routines. We briefly describe the 
actual program to show how a CELP system actually 
uses the procedures. Details of routines for codebook 
searching are presented in [GrLK93]. 


3.1. LSP Analysis 


First, a routine PCMtoFloat converts the speech 
samples s into a floating point form, since s usually 
comes from an analog-to-digital converter with integer 
data format, while floating-point computation is pre- 
ferred to reduce the distortion caused by finite-length 
registers. A routine AnalyzeLPC then extracts a from 
s, explained in Section 2.2.1. Prior to converting a to 
LSPs, the scheme calls a routine ExpandBandwidth 
to expand the bandwidth of a; using Eq. (13) with an 
expanding factor y of 0.994. This procedure ensures 
that a are within the range of the LSP table. The scheme 
calls ConvertLPCTOLSP routine to obtain LSP; from 
a, according to Section 2.2.2. Finally, a routine 
CheckLSPStability verifies the monotonicity of 
the LSP; before allowing them to be used (see section 
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3.2. Codebook Searching 


A computer routine called CodebookSearching 
finds the ACB and SCB parameters. First, we must con- 
struct a 60x60 matrix W representing W(z) (see Eq. 
(25)). The scheme starts with obtaining a. An Inter- 
po lat eLSP routine provides LSPs for each individual 
subframe by interpolation using Eq. (19). The filter W 
practically requires a instead of LSPs, thus the scheme 
calls ConvertLSPtoLPC routine for the conversion 
(see Section 2.4.2). An ExpandBandwidth routine 
then performs Eq. (13) to generate c, with an expanding 
factor y of 0.8. It is easy to show using Eq. (22) that 
W(z) is equivalent with H(z) with c replaces a. Further- 
more, to represent the filter W(z), W must contain the 
impulse responses, w,, of the filter, as shown in Eq. (22). 
A FindImpulseResponse routine provides such 
elements. 

Having constructed W, the scheme prepares for ACB 
parameter searching. The FindACBSearching- 
Target routine determines the ACB searching target 


y for the current subframe using Eq. (30). 


If the subframe is odd, i.e., the first or third subframe, 
the scheme calls the ACBSearchingOdd routine to 
get the ACB parameters, otherwise the ACBSearch- 
ingEven performs that function. The scheme then 
computes the searching target of SCB searching, by 
calling FindSCBSearchingTarget. The SCB- 
Searching obtains the SCB parameters and stores 
them in an output buffer. The routine now has a com- 
plete set of the codebook parameters. 

Before the loop proceeds for the next subframe, it 
must prepare and update the system states. First, an 
Updat eACB routine updates the contents of the ACB 
with new values from the excitation impulses to imitate 


the effects of delay elements in an all-pole, one-tap pitch 
filter. Second, the delay elements of H(z) must also be 
updated according to those of the CELP synthesizer. At 
this phase, the synthesizer has stored values in its delay 
elements which has an additive effect to the synthetic 
speech produced later in the next subframe. The Get ~ 
De layElement s tracks those values, which are later 
used by the next-subframe FindACBSearching- 
Target to compensate the additive effect represented 
by the zero response, as discussed in Section 2.3.3. 
Finally, if the subframe is even, i.e., the second or fourth 
subframe,theDel t aEncoding ACB routine encodes 
the ACB entry using a delta coder. 

Having obtained the LSPs and all entries for ACB and 
SCB for one frame, the scheme collects them in an FS- 
1016 data stream for transmission, by calling Con- 
vertToDataStream. A routine UpdatePrevi- 
ousLSP updates the contents of previous LSPs with the 
newly obtained LSPs to be used for interpolation (using 
Eq. (19)) and stability checking in the next frame. 


3.3. Speech Synthesis 


A synthesis program converts each FS- 1016 data 
stream into a frame of speech. First, a routine Con- 
vertFromStream unpacks; the LSPs and the entries 
of ACB and SCB from the data stream. Since two of the 
ACB entries are delta coded, a routine Delt aDecod- 
ing obtains the actual entries. 

As in the case of codebook searching, the synthesis 
performs a loop for four consecutive subframes. The 
loop starts withInterpolateLSP to obtain smooth 
transition of the LPC filter H(z), using Eq. (19). A rou- 
tine ConvertLSPtoLSP provides a from the LSPs to 
construct H(z) (see Section 2.4.2). To get the excitation 
impulses t, the loop calls Updat eACB, which computes 
t using the ACB and SCB entries, and also updates ACB 
using the resulting t. Finally, a routine GetDe- 
layE lement s applies the t to H(z) to produce the 
speech, and at the same time, updates the delay elements 
of H(z) to be used later for the next H(z). 

Before the process continues to the next frame, a rou- 
tine UpdatePreviousLSF updates the contents of 
previous LSPs with the current LSPs. 


4. PERFORMANCE 


The algorithm presented here is fast enough for 
practical uses, such as store-and-forward 
communication, voice-mail, and multimedia. We have 
ported the computer program for various platforms, 
including TMS C30, IBM PC, SUN workstations, and 
IBM PowerPC based workstation. It has also been 
ported as a dynamic link library (DLL) for Windows 
3.1, ready to be used for various speech applications. 
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Figure 3 shows a simple CELP compression application 
as an example of accessing the DLL. 


CELP Speech Compression 


© Decompress 
@ Compress 


_ i i 


Fig. 3. A simple Windows 3.1 CELP system utilizing 
the CELP dynamic link library. 


Table 1 shows that the execution time is within a 
reasonable range. On the IBM Power PC workstation, 
the algorithm run faster than real-time (0.85 real-time 
for both analysis and synthesis). The execution time of 
the C30 implementation is approximately two to three 


the routines require between five times to twice real- 
time requirement. Mhz IBM-PC 486DX 
requires approximately 14 times real-time. 


up the codebook searching, the synthesized speech still 
has high intelligibility and natural quality [Lang92]. 
The results from a Fairbanks rhyme test show an 
intelligibility score of more than 95% word correct 
identification. Furthermore, subjective and objective 
tests using male spoken Harvard sentences result in a 
mean opinion score (MOS) of 3.21 and a segmental 
signal-to-noise ratio (SEGSNR) of 10.10 dB, 
respectively. 


platforms, in terms of % real-time. 


Synthesis 
(% 


Analysis Time 
(% real-time) 


SUN Sparc 5 


PC-AT/ TMS 
C30 


PowerPC 
(Power 590) 


Furthermore, the size of the executable file is small. 
The C30 program and data require less than 11 Kwords 
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of memory. The size of the SUN version executable file 
is 64 Kbytes. The algorithm can be coded modularly in 
C to enable tailoring it to another application. 

However, the fast algorithms is quite complex, i.e., it 
involves many processes, loops, and variables. The 
efforts in reducing the computation time results in 
increasing the memory requirement to hold look-up 
tables and codebooks. The algorithm also reduces the 
overhead in data transfers by fixing the locations of 
arrays and globally using them. This increases the 
complexity, because data may be altered by several 
different processes, which means there are many 
processes that should be considered simultaneously. 


In most platforms, a real-time application still requires 
faster processors. Our observation on the C30 program 
reveals that the codebook searching consumes 218% of 
real-time requirement, i.e., 2.18 seconds of codebook 
searching are required for every second of speech. As 
shown in Table 2, this results from the inner products 
inside the joint optimization scheme, which consumes 
111% of the real-time requirement. This part should 
become the main attention to improve the execution 
speed. 

However, it should be noted that the synthesis part 
requires only 2 to 23% of the real-time requirement of 
execution time in all platforms. This means the system 
can easily perform real-time playback. This asymmetric 
type of systems (i.e. systems with easy playback) has 
found a wide range of applications, such as in 
broadcasting, database, library, and CD-ROM based 
multimedia. 


Table 2. Computation time requirements of some most 
demanding routines in TMSC30. 


Codebook Search 
SCB Searching Target —— 


5. DISCUSSION 


This paper has described an efficient algorithm and its 
implementation of the CELP speech processing 
system. Near real-time implementation is possible 
using fast extraction of LSP parameters, fast searches of 
ACB and SCB parameters, and CELP synthesis. The 
codebook searches employ the joint optimization 
scheme, which consumes the largest block of the 
codebook searching computation due to a combination 


of the complexity of this routine and the large number of 
times it is called by the ACB and SCB searching 
algorithms. The algorithm allows high quality speech to 
be achieved with a bit rate of as low as 4.8 kHz. The 
algorithm can be readily used for CELP 
implementations, such as on (i) high quality low-bit rate 
speech transmission in point-to-point or store-and- 
forward (network based) mode, and (ii) efficient speech 
storage in speech recording or multimedia databases. 
We are currently seeking hardware implementation to 
reduce not only the execution time, but also the physical 
size of the actual implementation, We are studying the 
algorithm for the purpose of casting some of its parts to 
silicon. At this stage, a full hardware implementation is 
premature since the optimality is not clear. However, 
we should focus on casting the inner products and 
convolution processes that have become the algorithm 
bottleneck. Implementing the inner product process in 
dedicated hardware is attractive, because it has a simple 
computational structure, i.e., a regular multiply and 
accumulate process of 60 terms. The convolution of 
SCB elements in Eq. (39) is also attractive for hardware 
implementation because the SCB elements are 
predefined. As explained in Section 2.3.6, the ternary 
property simplifies the convolution into an addition/ 
subtraction process with a branch controlled by SCB 
elements, making it easier to implement in hardware. 
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